<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>HTML Sanitization [Universal Feed Parser]</title>
<link rel="stylesheet" href="feedparser.css" type="text/css">
<link rev="made" href="mailto:mark@diveintomark.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.65.1">
<meta name="keywords" content="RSS, Atom, CDF, XML, feed, parser, Python">
<link rel="start" href="index.html" title="Documentation">
<link rel="up" href="advanced.html" title="Advanced Features">
<link rel="prev" href="date-parsing.html" title="Date Parsing">
<link rel="next" href="content-normalization.html" title="Content Normalization">
</head>
<body id="feedparser-org" class="docs">
<div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2">
<div class="s" id="pageHeader">
<h1><a href="/"><span>Universal Feed Parser</span></a></h1>
<p><span>Parse RSS and Atom feeds in Python.  3000 unit tests.  Open source.</span></p>
</div>
<div class="s" id="quickSummary"><ul>
<li class="li1">
<a href="http://sourceforge.net/projects/feedparser/"><span>Download</span></a> ·</li>
<li class="li2">
<a href="http://feedparser.org/docs/"><span>Documentation</span></a> ·</li>
<li class="li3">
<a href="http://feedparser.org/tests/"><span>Unit tests</span></a> ·</li>
<li class="li4"><a href="http://sourceforge.net/tracker/?func=browse&amp;group_id=112328&amp;atid=661937"><span>Report a bug</span></a></li>
</ul></div>
</div></div></div>
<div id="main"><div id="mainInner">
<p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <a href="advanced.html">Advanced Features</a> → <span class="thispage">HTML Sanitization</span></p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h2 class="title">
<a name="advanced.sanitization" class="skip" href="#advanced.sanitization" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> <acronym title="HyperText Markup Language">HTML</acronym> Sanitization</h2></div></div>
<div></div>
</div>
<div class="abstract"><p>Many feed elements may contain <acronym title="HyperText Markup Language">HTML</acronym> markup, and many feed aggregators use a web browser (or browser component) to display content.  By default, <span class="application">Universal Feed Parser</span> sanitizes <acronym title="HyperText Markup Language">HTML</acronym> markup in several elements, removing <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes that could introduce Javascript or other security risks.</p></div>
<p>These elements are sanitized by default:</p>
<div class="itemizedlist"><ul>
<li><a href="reference-feed-title.html" title="feed.title">feed.title</a></li>
<li><a href="reference-feed-subtitle.html" title="feed.subtitle">feed.subtitle</a></li>
<li><a href="reference-feed-info.html" title="feed.info">feed.info</a></li>
<li><a href="reference-feed-rights.html" title="feed.rights">feed.rights</a></li>
<li><a href="reference-entry-title.html" title="entries[i].title">entries[i].title</a></li>
<li><a href="reference-entry-summary.html" title="entries[i].summary">entries[i].summary</a></li>
<li><a href="reference-entry-content.html" title="entries[i].content">entries[i].content</a></li>
</ul></div>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> tags are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-element">a</tt>, <tt class="sgmltag-element">abbr</tt>, <tt class="sgmltag-element">acronym</tt>, <tt class="sgmltag-element">address</tt>, <tt class="sgmltag-element">area</tt>, <tt class="sgmltag-element">b</tt>, <tt class="sgmltag-element">big</tt>, <tt class="sgmltag-element">blockquote</tt>, <tt class="sgmltag-element">br</tt>, <tt class="sgmltag-element">button</tt>, <tt class="sgmltag-element">caption</tt>, <tt class="sgmltag-element">center</tt>, <tt class="sgmltag-element">cite</tt>, <tt class="sgmltag-element">code</tt>, <tt class="sgmltag-element">col</tt>, <tt class="sgmltag-element">colgroup</tt>, <tt class="sgmltag-element">dd</tt>, <tt class="sgmltag-element">del</tt>, <tt class="sgmltag-element">dfn</tt>, <tt class="sgmltag-element">dir</tt>, <tt class="sgmltag-element">div</tt>, <tt class="sgmltag-element">dl</tt>, <tt class="sgmltag-element">dt</tt>, <tt class="sgmltag-element">em</tt>, <tt class="sgmltag-element">fieldset</tt>, <tt class="sgmltag-element">font</tt>, <tt class="sgmltag-element">form</tt>, <tt class="sgmltag-element">h1</tt>, <tt class="sgmltag-element">h2</tt>, <tt class="sgmltag-element">h3</tt>, <tt class="sgmltag-element">h4</tt>, <tt class="sgmltag-element">h5</tt>, <tt class="sgmltag-element">h6</tt>, <tt class="sgmltag-element">hr</tt>, <tt class="sgmltag-element">i</tt>, <tt class="sgmltag-element">img</tt>, <tt class="sgmltag-element">input</tt>, <tt class="sgmltag-element">ins</tt>, <tt class="sgmltag-element">kbd</tt>, <tt class="sgmltag-element">label</tt>, <tt class="sgmltag-element">legend</tt>, <tt class="sgmltag-element">li</tt>, <tt class="sgmltag-element">map</tt>, <tt class="sgmltag-element">menu</tt>, <tt class="sgmltag-element">ol</tt>, <tt class="sgmltag-element">optgroup</tt>, <tt class="sgmltag-element">option</tt>, <tt class="sgmltag-element">p</tt>, <tt class="sgmltag-element">pre</tt>, <tt class="sgmltag-element">q</tt>, <tt class="sgmltag-element">s</tt>, <tt class="sgmltag-element">samp</tt>, <tt class="sgmltag-element">select</tt>, <tt class="sgmltag-element">small</tt>, <tt class="sgmltag-element">span</tt>, <tt class="sgmltag-element">strike</tt>, <tt class="sgmltag-element">strong</tt>, <tt class="sgmltag-element">sub</tt>, <tt class="sgmltag-element">sup</tt>, <tt class="sgmltag-element">table</tt>, <tt class="sgmltag-element">tbody</tt>, <tt class="sgmltag-element">td</tt>, <tt class="sgmltag-element">textarea</tt>, <tt class="sgmltag-element">tfoot</tt>, <tt class="sgmltag-element">th</tt>, <tt class="sgmltag-element">thead</tt>, <tt class="sgmltag-element">tr</tt>, <tt class="sgmltag-element">tt</tt>, <tt class="sgmltag-element">u</tt>, <tt class="sgmltag-element">ul</tt>, <tt class="sgmltag-element">var</tt></span>
</p>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> attributes are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-attribute">abbr</tt>, <tt class="sgmltag-attribute">accept</tt>, <tt class="sgmltag-attribute">accept-charset</tt>, <tt class="sgmltag-attribute">accesskey</tt>, <tt class="sgmltag-attribute">action</tt>, <tt class="sgmltag-attribute">align</tt>, <tt class="sgmltag-attribute">alt</tt>, <tt class="sgmltag-attribute">axis</tt>, <tt class="sgmltag-attribute">border</tt>, <tt class="sgmltag-attribute">cellpadding</tt>, <tt class="sgmltag-attribute">cellspacing</tt>, <tt class="sgmltag-attribute">char</tt>, <tt class="sgmltag-attribute">charoff</tt>, <tt class="sgmltag-attribute">charset</tt>, <tt class="sgmltag-attribute">checked</tt>, <tt class="sgmltag-attribute">cite</tt>, <tt class="sgmltag-attribute">class</tt>, <tt class="sgmltag-attribute">clear</tt>, <tt class="sgmltag-attribute">cols</tt>, <tt class="sgmltag-attribute">colspan</tt>, <tt class="sgmltag-attribute">color</tt>, <tt class="sgmltag-attribute">compact</tt>, <tt class="sgmltag-attribute">coords</tt>, <tt class="sgmltag-attribute">datetime</tt>, <tt class="sgmltag-attribute">dir</tt>, <tt class="sgmltag-attribute">disabled</tt>, <tt class="sgmltag-attribute">enctype</tt>, <tt class="sgmltag-attribute">for</tt>, <tt class="sgmltag-attribute">frame</tt>, <tt class="sgmltag-attribute">headers</tt>, <tt class="sgmltag-attribute">height</tt>, <tt class="sgmltag-attribute">href</tt>, <tt class="sgmltag-attribute">hreflang</tt>, <tt class="sgmltag-attribute">hspace</tt>, <tt class="sgmltag-attribute">id</tt>, <tt class="sgmltag-attribute">ismap</tt>, <tt class="sgmltag-attribute">label</tt>, <tt class="sgmltag-attribute">lang</tt>, <tt class="sgmltag-attribute">longdesc</tt>, <tt class="sgmltag-attribute">maxlength</tt>, <tt class="sgmltag-attribute">media</tt>, <tt class="sgmltag-attribute">method</tt>, <tt class="sgmltag-attribute">multiple</tt>, <tt class="sgmltag-attribute">name</tt>, <tt class="sgmltag-attribute">nohref</tt>, <tt class="sgmltag-attribute">noshade</tt>, <tt class="sgmltag-attribute">nowrap</tt>, <tt class="sgmltag-attribute">prompt</tt>, <tt class="sgmltag-attribute">readonly</tt>, <tt class="sgmltag-attribute">rel</tt>, <tt class="sgmltag-attribute">rev</tt>, <tt class="sgmltag-attribute">rows</tt>, <tt class="sgmltag-attribute">rowspan</tt>, <tt class="sgmltag-attribute">rules</tt>, <tt class="sgmltag-attribute">scope</tt>, <tt class="sgmltag-attribute">selected</tt>, <tt class="sgmltag-attribute">shape</tt>, <tt class="sgmltag-attribute">size</tt>, <tt class="sgmltag-attribute">span</tt>, <tt class="sgmltag-attribute">src</tt>, <tt class="sgmltag-attribute">start</tt>, <tt class="sgmltag-attribute">summary</tt>, <tt class="sgmltag-attribute">tabindex</tt>, <tt class="sgmltag-attribute">target</tt>, <tt class="sgmltag-attribute">title</tt>, <tt class="sgmltag-attribute">type</tt>, <tt class="sgmltag-attribute">usemap</tt>, <tt class="sgmltag-attribute">valign</tt>, <tt class="sgmltag-attribute">value</tt>, <tt class="sgmltag-attribute">vspace</tt>, <tt class="sgmltag-attribute">width</tt></span>
</p>
<a name="id4956096"></a><table class="note" border="0" summary="">
<tr><td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td></tr>
<tr><td colspan="2" align="left" valign="top" width="99%">The <a href="http://feedparser.org/tests/wellformed/sanitize/">unit tests for <acronym title="HyperText Markup Language">HTML</acronym> sanitizing</a> show many different examples of dangerous markup that <span class="application">Universal Feed Parser</span> sanitizes by default.</td></tr>
</table>
<p>One emerging technology that affects feed parsing is the inclusion of <a href="http://microformats.org/">microformats</a> within syndicated content.  Briefly, publishers can add additional semantics to their <acronym title="HyperText Markup Language">HTML</acronym> content using <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes.  <span class="application">Universal Feed Parser</span> does not currently parse microformat content within embedded <acronym title="HyperText Markup Language">HTML</acronym> markup, but it doesn't destroy it either.  Both the <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes survive <acronym title="HyperText Markup Language">HTML</acronym> sanitizing, so applications built on <span class="application">Universal Feed Parser</span> that wish to parse microformat content are free to do so.</p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.sanitization.why" class="skip" href="#advanced.sanitization.why" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Whitelist, Don't Blacklist</h3></div></div>
<div></div>
</div>
<p>I am often asked why <span class="application">Universal Feed Parser</span> is so hard-assed about <acronym title="HyperText Markup Language">HTML</acronym> sanitizing.  This topic usually comes up when someone notices that <span class="application">Universal Feed Parser</span> strips all <tt class="sgmltag-attribute">style</tt> attributes by default.</p>
<p>Here is an incomplete list of potentially dangerous <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes:</p>
<div class="itemizedlist"><ul>
<li>
<tt class="sgmltag-element">script</tt>, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">applet</tt>, <tt class="sgmltag-element">embed</tt>, and <tt class="sgmltag-element">object</tt>, which can automatically download and execute malicious code</li>
<li>
<tt class="sgmltag-element">meta</tt>, which can contain malicious redirects</li>
<li>
<tt class="sgmltag-attribute">onload</tt>, <tt class="sgmltag-attribute">onunload</tt>, and all other <tt class="sgmltag-attribute">on*</tt> attributes, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">style</tt>, <tt class="sgmltag-element">link</tt>, and the <tt class="sgmltag-attribute">style</tt> attribute, which can contain malicious script</li>
</ul></div>
<p><span class="emphasis"><em><tt class="sgmltag-attribute">style</tt>?</em></span>  Yes, <tt class="sgmltag-attribute">style</tt>.  <acronym title="Cascading Style Sheets">CSS</acronym> definitions can contain executable code.</p>
<div class="example">
<a name="example.javascript" class="skip" href="#example.javascript" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>This sample is taken from <a href="http://feedparser.org/docs/examples/rss20.xml">http://feedparser.org/docs/examples/rss20.xml</a>:</p>
<pre class="programlisting ">
&lt;description&gt;Watch out for
&amp;lt;span style="background: url(javascript:window.location='http://example.org/')"&amp;gt;
nasty tricks&amp;lt;/span&amp;gt;&lt;/description&gt;</pre>
<p>This sample is more advanced, and does not contain the keyword <tt class="literal">javascript:</tt> that many naive <acronym title="HyperText Markup Language">HTML</acronym> sanitizers scan for:</p>
<pre class="programlisting ">&lt;description&gt;Watch out for
&amp;lt;span style="any: expression(window.location='http://example.org/')"&amp;gt;
nasty tricks&amp;lt;/span&amp;gt;&lt;/description&gt;</pre>
<p>Internet Explorer for Windows will execute the Javascript in both of these examples.</p>
</div>
<p>Now consider that in <acronym title="HyperText Markup Language">HTML</acronym>, attribute values may be entity-encoded in several different ways.</p>
<div class="example">
<a name="example.javascript.encoded" class="skip" href="#example.javascript.encoded" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding encoded Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>To a browser, this:</p>
<pre class="programlisting ">&lt;span style="any: expression(window.location='http://example.org/')"&gt;</pre>
<p>is the same as this (without the line breaks):</p>
<pre class="programlisting ">&lt;span style="&amp;#97;&amp;#110;&amp;#121;&amp;#58;&amp;#32;&amp;#101;&amp;#120;&amp;#112;&amp;#114;&amp;#101;
&amp;#115;&amp;#115;&amp;#105;&amp;#111;&amp;#110;&amp;#40;&amp;#119;&amp;#105;&amp;#110;&amp;#100;&amp;#111;&amp;#119;
&amp;#46;&amp;#108;&amp;#111;&amp;#99;&amp;#97;&amp;#116;&amp;#105;&amp;#111;&amp;#110;&amp;#61;&amp;#39;&amp;#104;
&amp;#116;&amp;#116;&amp;#112;&amp;#58;&amp;#47;&amp;#47;&amp;#101;&amp;#120;&amp;#97;&amp;#109;&amp;#112;&amp;#108;
&amp;#101;&amp;#46;&amp;#111;&amp;#114;&amp;#103;&amp;#47;&amp;#39;&amp;#41;"&gt;</pre>
<p>which is the same as this (without the line breaks):</p>
<pre class="programlisting ">&lt;span style="&amp;#x61;&amp;#x6e;&amp;#x79;&amp;#x3a;&amp;#x20;&amp;#x65;&amp;#x78;&amp;#x70;&amp;#x72;
&amp;#x65;&amp;#x73;&amp;#x73;&amp;#x69;&amp;#x6f;&amp;#x6e;&amp;#x28;&amp;#x77;&amp;#x69;&amp;#x6e;
&amp;#x64;&amp;#x6f;&amp;#x77;&amp;#x2e;&amp;#x6c;&amp;#x6f;&amp;#x63;&amp;#x61;&amp;#x74;&amp;#x69;
&amp;#x6f;&amp;#x6e;&amp;#x3d;&amp;#x27;&amp;#x68;&amp;#x74;&amp;#x74;&amp;#x70;&amp;#x3a;&amp;#x2f;
&amp;#x2f;&amp;#x65;&amp;#x78;&amp;#x61;&amp;#x6d;&amp;#x70;&amp;#x6c;&amp;#x65;&amp;#x2e;&amp;#x6f;
&amp;#x72;&amp;#x67;&amp;#x2f;&amp;#x27;&amp;#x29;"&gt;</pre>
<p>And so on, plus several other variations, plus every combination of every variation.</p>
</div>
<p>The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it.  This is why <span class="application">Universal Feed Parser</span> uses a whitelist and not a blacklist.   I am reasonably confident that none of the elements or attributes on the whitelist are security risks.  I am not at all confident about elements or attributes that I have not explicitly investigated.  And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.  I will not attempt to preserve “<span class="quote">just the good styles</span>”.  All styles are stripped.</p>
<div class="furtherreading">
<h3>Elsewhere</h3>
<ul><li><a href="http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely">How to consume RSS safely</a></li></ul>
</div>
</div>
</div>
<div style="float: left">← <a class="NavigationArrow" href="date-parsing.html">Date Parsing</a>
</div>
<div style="text-align: right">
<a class="NavigationArrow" href="content-normalization.html">Content Normalization</a> →</div>
<hr style="clear:both">
<div class="footer"><p class="copyright">Copyright © 2004, 2005, 2006 Mark Pilgrim</p></div>
</div></div>
</body>
</html>
